Ultracomputer  Packaging  and  Prototypes 

by 
Ronald  Bianchini 

Ultracomputer  Note  #152 

January,  1989 

(Revised  -  May,  1989) 


Ultracomputer  Research  Laboratory 


New  York  University 

Courant  Institute  of  Mathematical  Sciences 

Division  of  Computer  Science 

251  Mercer  Street,  New  York,  NY  10012 


This  work  was  supported  under  the  Department  of  Energy  grant  No.    DE-FG-02-88-ER25052. 


Table  of  Contents 


1.  Introduction . . ^ 

2.  Interconnection  Networks ^ ^ 

3.  Ultracomputer 'Button' Technology  Modules    ^ 16 

4.  16  to  256  PE  Ultracomputers . 19 

5.  Larger  Ultracomputers 29 

6.  NYU  Ultracomputer  Prototypes 32 

7.  Ultra  III  Prototype -  3^ 

Appendix  ^"^ 

References ^^ 


Abstract 

This  paper  reviews  interconnection  networks  and  describes  packaging  of  the 
NYU  Ultracomputer.  The  packaging  is  based  on  the  'button'  technology  developed  by 
TRW  and  adopted  by  MCC  for  its  ES-Kit  modules.  This  paper  presents  the  basic  ES-Kit 
modules  needed  to  assemble  several  sizes  of  Ultracomputers.  It  also  reviews 
Ultracomputer  prototypes  and  previews  future  prototypes. 

Ultracomputer 


T 

n 

Pt 

P  t 

P  t 

Interconnection   Network 

M 

M 

M 

M 

MM 

Figure  1.1 


1.    Introduction 


The  Ultracomputer  is  a  shared  memory  multiprocessor  computer  (Figure  1.1). 
Since  the  beginning  of  the  NYU  Ultracomputer  project  in  1979,  packaging  of  the 
interconnection  network  has  been  the  fundamental  limitation  to  the  maximum  realizable 
size  and  performance  of  an  Ultracomputer.  In  "Wireability  of  an  Ultracomputer,"  the 
wiring  of  a  4096  processor  (PE)  Ultracomputer  was  described  (Figures  1.2  and  1.3).  It 
was  shown  how  a  three  dimensional  wiring  scheme  for  an  omega  interconnection 


network  could  produce  short  interstage  wires  (Figures  1.4  and  1.5).  In  "An  Overview  of 
the  NYU  Ultracomputer  Project,"^  the  computational  speeds  and  challenges  of 
parallelism  were  addressed.  In  this  paper  a  simple  modular  packaging  scheme  for  an 
Ultracomputer  based  on  the  'button'  technology  ^'^  developed  by  TRW  for  the  VHSIC 
Phase  2  Packaging  Project  will  be  presented.  MCC,  the  Microelectronics  and  Computer 
Technology  Corporation,  adopted  this  technology  for  its  Experimental  Systems  (ES-Kit) 
modules  ^•^.  The  ES-Kit  research  project's  goal  is  to  develop  rapid  implementation  of 
low  cost  prototyping  of  parallel  computing  systems.  The  packaging  scheme  described 
here  bases  its  design  on  a  simple  set  of  Ultracomputer  ES-Kit  modules.  Existing  NYU 
Ultracomputer  prototypes  are  also  presented  in  this  paper  along  with  a  preview  of 
proposed  prototypes. 


4096  PE  Ultracomputer    (Using  64X64  Blocks) 


Row 


5  ft 


PE        SW       SW 

Straight  thru  wires 


Note:  There  are  8  processors 
and  8  memory  modules 
per  PE  and  MM  board. 


Figure  1.2 


4096  PE  Ultracomputer    (Using  64X64  Planes) 


5  ft 


PE        SW 

Straight  thru  wires 


SW         MM 

Large   Board 


Figure  1.3 


a 


16X16  Omega  Network  (2   Dimension) 
Using   2X2  Switches 


Shuffle 


16X16  Omega  Network  (3  Dimension) 
Using  2X2  Switches  on  4X4  Boards 


0100 


1110 


1111 


0000 
0100 


Figure  1.4 


64X64  Omega  Network  (3  Dimension) 
Using  2X2  Switches  on  8X8  boards 


70 


77 


73-^^^^^^ 


64X64   Blocks 


Figure  1.5 


8 


2.     Interconnection  Networks 

The  Ultracomputer,  a  high  performance  multiprocessor  shared  memory 
architecture,  requires  an  interconnection  network  to  connect  the  processors  (PEs)  to  the 
memories  (MMs).  The  choice  of  a  network  affects  the  performance  and  size  of  an 
Ultracomputer.  As  used  in  Ultra  II,  described  later,  a  Bus  (Figure  2.1)  is  one  possibility 
for  such  a  network.  Its  advantage  is  that  the  number  of  wires  needed  is  proportional  to 
N,  the  number  of  PEs.  Its  limitations  are  the  fixed  bus  bandwidth  and  resulting  loading. 
Due  to  these  limitations  the  number  of  processors  supported  by  a  bus  is  in  the  order  of 
tens  of  PEs.  A  Cross  Bar  switch  with  fanout  and  multiplexing  capability  at  each  module 
(Figure  2.2)  is  another  possible  network.  It  has  excellent  connectivity  and  bandwidth, 
but  requires  order  N^  wires.  The  maximum  number  of  processors  buildable  with  such  a 
network  is  about  a  hundred. 


Bus  Interconnection  Network 
wiring  order  N 


PE 


PE 


I 
I 
I 

PE 


N 


/\ 


B 
U 
S 


Figure  2.1 


MEM 

1 

MEM 

2 

1 
1 

1 

MEM 

N 

Cross  Bar  Interconnection  Network 
wiring  order  N 


Figure  2.2 


For  thousands  of  processors  an  omega  interconnection  network  is  a  choice  to  be 
considered  (Figure  2.3).  This  network  has  bandwidth  proportional  to  N,  wiring  order  N 
log  N,  and  an  access  latency,  ignoring  queueing  delays,  proportional  to  log  N  .  Network 
combining  of  requests  and  processor  caches  help  to  minimize  the  limitation  of  access 
latency.  The  network  switch  is  a  critical  element  of  this  network  architecture.  It  must 
route,  queue  and  combine  requests.  Fetch  and  Add,  which  is  a  combinable  request  and 
a  required  operation  of  the  Ultracomputer,  must  also  be  supported  by  the  switch.  This 
requires  the  switch  to  support  arithmetic  operations.  Also,  to  minimize  routing  delays 
and  permit  one  clock  latency  through  the  switches  with  combining,  packets  and  pinouts 
must  be  wide  enough  to  provide  full  address  and  request  information  simultaneously. 
This  requirement  means  that  for  a  2X2  switch  the  number  of  pins  needed  must  be  four 
times  about  40  signals  for  each  direction  including  flow  control,  or  about  320  signal  pins 
per  switch  chip.  A  package  must  provide  about  400  pins  to  contain  one  2X2  switch 
(Figure  2.4).  At  NYU  a  VLSI  effort  has  been  underway  for  the  design  of  this  2X2 
switch^'®.  A  modified  Ultra  II  prototype  is  presently  in  operation  with  an  NYU  non- 
combining  VLSI  2X2  switch. 


Omega  Interconnection   Network 


NYU   2X2   Switch   (VLSI) 


wiring  order  N 

1      2 

Log  N 

Log  N 

PE 

1 

1              A 
1             W 



MEM 

1 

PE 

2 

MEM 

2 

1 

1 
1 

1 
1 

1 

PE 

N 

MEM 

N 

(pins) 


(pins) 


Queue  &  Route  (Packets) 

Combine  &  Uncombine  (Wait  Buffers) 

Fetch  &  Add    (Arithmetic  Functions) 


Figure  2.3 


Figure  2.4 


10 


Following  are  a  group  of  Figures  which  demonstrate  how  a  large  omega  network 
can  be  packaged.  Depicted  in  Figure  2.5  are  several  topologically  equivalent  16X16 
omega  networks.  The  interstage  wiring  patterns  vary  but  the  only  difference  is  in  the 
placement  of  the  2X2  switches.  Figures  2.6  and  2.7  show  how  the  wiring  patterns  vary 
for  a  64X64  omega  network  depending  on  the  grouping  of  2X2  switches.  For  a  VLSI 
implementation  the  tiling  of  4X4's  to  make  a  64X64  should  produce  a  smaller  geometry 
(Figure  2.6).  In  Figure  2.8  the  three  dimensional  package  of  these  64X64  networks  is 
shown.  Figure  2.9  demonstrates  how  a  stack  of  4X4  switches,  with  inputs  on  top  and 
outputs  on  bottom,  can  be  configured  for  a  64X64  omega  network.  This  scheme 
requires  routing  channels  between  the  three  layers  and  will  be  described  in  detail  in  the 
next  section. 


16  X  16  Omega  Interconnection  Networks 


Shuffle 


Binary 

Figure  2.5 


4  X4's 


11 


64  X  64  Omega  Interconnection  Network 
using  4  X  4's 


Figure  2.6 


12 


64  X  64  Omega  Interconnection  Network 
using  8  X  8's 


Figure  2.7 


13 


3  Dimension  64  X  64  Switches 


vgpl^^ 

,           ^^      fe^^ 

^itk^^iipiti = 

Ji JIJI 

^^^kVjfojIC' 

^ '^ 

3^WjQ^^ 

1?  1  fTi^ 

rrT~rww~i~rT = 

V*    1  /A\  1    /\    Z. ' 

?  i '  J  '^'?^?T 

^^^! 

^ jz>i:>g^ 

16  X  16's 


4  X4's 


8  X8'S 


8X8'S 


Figure  2.8 


14 


3  Dimension  64  X  64  Switch  Stack 


Figure  2.9 


15 


3.     Ultracomputer  'Button'  Technology  Modules 

A  simple  modular  packaging  scheme  for  an  Ultracomputer  using  the  'button' 
technology  developed  by  TRW  will  now  be  presented.  Note:  a  recent  AT&T  Bell 
Laboratories  developed  Elastomeric  Conductive  Polymer  Interconnect  (ECPI)^  exhibits 
an  equivalent  three  dimensional  continuity  property  and  could  serve  as  a  replacement 
for  the  'buttons'  in  what  follows.  The  ECPI  is  a  thin  sheet  of  resilient  silicone  rubber 
insulation  imbedded  with  densely  populated  conductors. 

The  'button'  technology  has  been  designed  to  facilitate  the  interconnection  of 
stacked  modules  with  separate  sets  of  connections  on  the  top  and  bottom  faces  of  each 
module.  If  the  'button'  technology  ES-Kit  packaging  system,  developed  by  MCC 
corporation,  is  used,  a  simple  set  of  'Ultra'  modules  can  be  devised  for  the  assembly  of 
several  sizes  of  Ultracomputers. 

Three  basic  module  types  are  necessary  to  form  a  four  processor  Ultracomputer, 
a  processor  (PE)  module,  a  memory  (MM)  module,  and  a  four  input-four  output  switch 
(SW)  module.  The  basic  ES-kit  module  contains  four  groups  of  106  connection  pads  on 
both  top  and  bottom  faces  of  the  module,  two  groups  on  each  end  of  the  7  inch  side  of 
the  7  inch  by  5.5  inch  module.  With  discrete  components  an  Ultracomputer  module  will 
require  a  double  (7  inch  by  10  inch)  size  module  (see  Appendix  for  detailed  drawings  of 
board  and  'buttons').  For  clarity  in  this  description,  it  is  assumed  that  coplaner  modules 
are  centered  on  a  square  grid.  The  center-to-center  distance  serves  as  a  unit  of  length 
in  what  follows. 


16 


The  wiring  of  both  the  PE  and  MM  modules  is  configured  so  that  their  external 
connections  are  located  on  one  corner  of  the  top  face  and  also  wired  to  a  corner  of  the 
bottom  face  rotated  by  90  degrees.  The  remaining  three  top  corner  connections  are 
separately  wired  from  a  top  corner  to  a  90  degree  rotated  bottom  corner  (Figure  3.1). 
With  this  wiring  scheme  all  external  connections  of  four  stacked  PEs  or  MMs  are  brought 
to  both  the  top  and  bottom  faces  of  the  stack.  A  stack  of  up  to  four  PEs  and  MMs  forms  a 
useful  assembly  for  diagnosis  of  PE  /  MM  operation.  A  one  PE  /  MM  combination  is 
shown  in  Figure  3.4. 


Top  Side 


Bottom  Side 


—/^^ 


Processor  Module     (PE) 


Memory  Module     (MM) 


Figure  3.1 

The  four  input-four  output  switch  (SW)  module  has  its  inputs  wired  to  each  of  the 
four  top  corners  and  its  outputs  to  the  corresponding  four  bottom  corners  (Figure  3.2). 
A  stack  of  nine  modules,  (4PEs,  1SW,  and  4MMs)  comprises  a  4PE  Ultracomputer 
(Figure  3.3  and  Figure  3.4).  This  combination  is  expanded  in  Figure  3.4  to  show  the 
'button'  technology  used  to  connect  adjacent  boards. 


/^^ 

\ 

\/ 

II 

Tsr7 

.^^ -^ /■ /■  z*  X //■  z' y  •  •  • 

^^ZLtg  y  ^  •  y  X  y  y  •  y  y  y^  ^ 

>* 

V 

Switch  Module     (SW) 


Figure  3.2 


4  PE 


^    MMs 

-SW    (4X4) 
PES 


Figure  3.3 


17 


4  PE  Ultracomputer 


PE 


PE 


PE 


PE 


sw 


MM 


MM 


MM 


MM 


Button 
Assembly 


1  PE  /  MM 


PE 


MM 


4X4    Switch 


SW 


1^^ 

"-^^-^ 

■  JHPP 

VSSB^""" 

"naatwiz 

s  ^^^^^^ 

W^SSm^ 

Figure  3.4 


18 


4.     16  to  256  PE  Ultracomputers 

Incorporating  simple  wire  routers  with  tine  three  module  types  described  above 
permits  still  larger  Ultracomputers  to  be  assembled.  Those  easily  built  are  successive 
multiples  of  four,  i.e.  4,  16,  64,  256,  etc.  The  most  straightforward  packaging  for  larger 
Ultras  require  an  additional  component,  routers  that  connect  4X4  switch  stages.  Two 
types  of  routers  are  added  between  stages,  one  vertical  and  the  other  horizontal.  These 
components  have  a  small  dimension  of  one  unit  and  a  large  dimension  of  two,  four, 
eight,  etc.,  units  depending  on  the  number  of  stages  required  for  a  particular  size 
Ultracomputer. 

For  example,  to  build  a  16PE  Ultracomputer,  only  two  kinds  of  routers  are 
needed,  a  horizontal  router  (hR),  one  unit  high  by  two  units  wide,  and  a  vertical  router 
(vR)  two  units  high  by  one  unit  wide  (Figure  4.1).  Two  of  each  are  required.  The 
function  of  these  routers  is  simply  to  distribute  each  of  the  four  outputs  of  a  4X4  switch 
stage  to  one  input  of  each  of  four  different  4X4  switches  of  the  next  stage. 


Horizontal  Router    (hR) 


Vertical  Router    (vR) 


Figure  4.1 


19 


One  composition  of  a  16PE  Ultracomputer  would  have  four  stacks  each  with 
4PEs  and  one  SW,  on  a  layer  of  two  hRs,  followed  by  a  layer  of  two  vRs,  and  finally  four 
stacks  each  having  one  SW  and  4MMs  (Figures  4.2  and  4.3). 

16  PE  Ultracomputer  (with  hR  and  vR) 


Top  Side 
Bottom  Side 


Processor  Module     (PE) 


Switch  Module     (SW) 


Horizontal  Router     (hR) 


Vertical  Router    (vR) 


Switch  Module     (SW) 


Memory  Module     (MM) 


Figure  4.2 


20 


16  PE  Ultracomputer  (with  hR  and  vR) 


Horizontal    Router 
hR 


Vertical  Router 
vR 


MMs 
SWs    (4X4) 
PES 


Figure  4.3 


It  is  also  possible  to  build  a  16PE  Ultra  with  only  one  long  router  between  the 
stages,  either  a  horizontal  router  (HR)  one  unit  high  by  four  units  long,  or  a  vertical  router 
(VR)  four  units  high  by  one  unit  long.  This  is  advantageous  for  16PE  and  larger 
Ultracomputers,  since  it  reduces  the  number  of  routers,  and  thickness  of  the  stack.  For 
the  following  16PE  implementation  only  one  layer  of  one  HR  router  is  required  (Figures 
4.4,  4.5  and  4.6).  Such  a  router  distributes  the  four  outputs  of  a  4X4  switch  to  one  input 
each  of  four  4X4  switches  of  the  next  stage.  Figure  4.6  shows  how  such  an  assembly  is 
physically  held  together  by  nuts  and  bolts.  The  nuts  and  bolts  also  facilitate  the 
electrical  connection  between  'buttons'  and  modules. 

16  PE  Ultracomputer  (with  HR) 


Horizontal   Router 
HR 


■^^ja^jg-^sa^ 


MMs 
SWs    (4X4) 
PES 


Figure  4.4 


21 


16  PE  Ultracomputer  (with  HR) 


Figure  4.5 


22 


16  PE  Ultracomputer  (with  HR) 
Showing  Buttons,  Nuts  and  Bolts 


■"    ■  ^^  '    "      M=     ^^ 


JDL        "    '  ^^   I    "^       .Ms      ^^ 


PE 


PE 


PE 


PE 


PE 


PE 


PE 


PE 


PE 


PE 


PE 


PE 


SU 


Slil 


^ ^ 


•^      W^ 


SlU 


s s 


Slil 


su 


MM 


MM 


MM 


MM 


M 


SUi 


MM 


MM 


MM 


f      -y 


MM 


SLU 


MM 


MM 


MM 


Tgr      "^ 


MM 


SlU 


JO 


.^  Button 
—  Spacer 
^**  Button 


-^         Horizontal 
^       Router 


MM 


MM 


MM 


"F      T 


MM 


HR 


^yr—  Nut 

I 

Bolt 


Figure  4.6 


23 


In  a  64PE  Ultracomputer  (Figures  4.7,  4.8  and  4.9),  both  the  HR  and  VR  routers 
are  needed,  requiring  a  stack  thickness  of  13  (4PEs,  SW,  VR,  SW,  HR,  SW  and  4MMs). 
Figure  4.8  shows  the  detail  wiring  pattern  of  the  HR  or  VR  and  the  proposed  Ultra  III 
packaging.   Figure  4.9  depicts  quantities  and  details  of  the  components. 


64  PE  Ultracomputer 


Horizontal    Router 
HR 


■J^^JLi^ia^iiL^ 


Vertical   Router 
VR 


MMs 
SWs    (4X4) 
PES 


Figure  4.7 


24 


64  PE    Ultra  III  Package 


Horizontal   Router 
SW  HR 


Figure  4.8 


25 


64  PE  Ultracomputer  (Components) 


^i^^it^ia^^ 


Components 


.  PE 

64 

■  MM 

64 

■ 

■ 

■ 

■  HR  . 

4X4 

Ho 

rizonta 

1     Rou 

ter 

■  SW 

4  8 

VR 


Vertical   Router 

Figure  4.9 
A  256PE  Ultracomputer  will  include  another  stage  of  4X4  SWs  and  two  larger 
wire  routing  modules.  A  large  vertical  router  (LVR),  eight  units  high  by  one  unit  wide, 
and  a  large  horizontal  router  (LHR),  one  unit  high  by  eight  units  wide.  Four  64PE 
Uitracomputers  are  grouped  in  a  two  by  two  array.  The  last  4X4  SW  stage  and  4MMs 
are  moved  back  and  a  stage  of  4X4  SWs,  a  large  vertical  router,  and  a  large  horizontal 
router  are  inserted  into  the  stack.   Its  composition  is  4PEs,  SW,  VR,  SW,  HR,  SW,  LVR, 


26 


LHR,  SW  and  4MMs,  or  16  layers  in  a  stack  (Figure  4.10).  An  alternative  stacking  witli 
the  same  modules  is  4PEs,  SW.  VR.  SW,  LVR,  LHR,  SW,  HR,  SW  and  4MMs  (Figure 
4.11). 

256  PE  Ultracomputer 


j^^^^^^^ 


'^ — n 


MMs 


LHR    or    LVR 

fo" 

o 

o 

o 

o 

o 

O       O       lO       O 

fi 

~~s 

o 

o 

o 

o 

o 

o| 

u 

o 

0     01     o     o 

1 

Swap 

Figure  4.10 


27 


256  PE  Ultracomputer 
Alternative   Stacking 


^^JiLi^ill^ilL^ 


MMs 


SWs 


PES 


Figure  4.11 


28 


5.     Larger  Ultracomputers 

Larger  Ultracomputers  are  possible  with  'button'  packaging.  It  should  be  noted 
that  the  longest  wire  in  simple  routers  is  half  the  length  of  the  router  and  also  that  large 
Ultracomputers  do  not  require  one  piece  long  routing  boards.  Instead,  these  boards  can 
be  made  of  layers  of  smaller  boards  and  'buttons'.  When  half-size  boards  are  used  to 
replace  a  full-size  board,  two  additional  'button'  layers  and  three  layers  of  overlapped 
half-size  boards  filled  with  either  longer  'button'  connectors  or  quarter-size  null  boards 
are  required  (Figure  5.1).  Similarly,  these  half-size  boards  can  in  turn  be  made  of 
quarter-size  boards.  This  process  ends  with  the  hR  or  vR.  Such  substitutions,  aside 
from  eliminating  the  requirement  for  large  boards,  reduce  the  density  of  wires  at  the 
center  of  the  remaining  boards,  but  at  the  cost  of  more  layers  of  'buttons'  and  boards. 

Large  Routers  Assembled  using  Small  Routers 
LHR  or  LVR 


hR  or  vR 


Notes:    Drawing  Is  a  cross  section.    On  these  routing  boards, 
half  the  connections  are  straight-through  and  are  not  shown. 

Figure  5.1 
A  1024  PE  and  a  4096  PE  configurations  with  the  above  modules  and  routers  will 
produce  physically  large,  relatively  thin,  Ultracomputers  (Figures  5.2  and  5.3).  Smaller 
'button'  boards  with  more  connections  and  higher  densities  coupled  with  more  PEs, 


29 


MMs  and  larger  omega  networks  per  SW  (utilizing  4X4,  8X8  or  larger  switch  chips)  will 
produce  physically  smaller  Ultracomputers.  For  example,  if  the  number  of  connections 
per  'button'  board  doubles,  then  an  8X8  omega  network  could  be  packaged  on  a  SW 
module.  The  larger  omega  network  per  SW  module  will  then  reduce  the  number  of  inter 
SW  wire  router  boards  required  for  a  given  size  Ultracomputer. 

The  complexity  of  packaging  and  wiring  omega  interconnection  networks  will 
always  limit  the  size  of  an  Ultracomputer,  however,  it  will  still  be  possible  to  build  them 
with  thousands  of  processors. 

1024  PE  Ultracomputer 


n-nn-nn— nrr-nn-nn-nn-nir-nn-nn-nii-nir-iin— nir 


UUUUUUL 

iiyiyuuuLiiy 

ODQDDDE 

9DDDQQDQ 

DOODDDE 

IQDDQDDO 

pDDDDDE 

gDDDDDDd 

DDDDDDE 

gDQQDDDO 

QDDDDQE 

idddddqQ 

QDDDDDE 

idddqddQ 

-innnnnor 

lonnnoniri 

a 


S==' 


SWs 


Figure  5.2 


30 


4096  PE  Ultracomputer 


uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu 
nnnnnannnnnnnnanDnnnnnnnDDnnDE: 
DnaannnnanannDnnnnnnanDDaDnnnE 
DnnnnannnnannnnnnnnnnnnnnaaDDE 
nnannnnnaaannnDnnnnnnnnannnnnE 
nnaaanannaannnannnDDnnnnnannnE 
annnannannaaDDnnnnDDnnnnnannnE 
naannnnnannnnnnnnaDannnnnnnnnc 
nDDDnnnannnnnnaDnnnnnnnnaannnc: 
□nnDanDnnnDDnnnnnnnaDnnnnnnnac 
nnaannnnanDDnnnnanDnnnnnnannnE 
DDDnnnDaDaannnnDDnDnnnnnnanDnE 
annnnnnnnDnnnannannDnnnnannnDE 
DnnnnaaanDnnaannnnnnnDaaDnnnDE 
□nnnnnnaaaannnnaannDnnnnnaDnni: 

nnnnnnpnnnnnnrmriririnririnriririnnnnn 


nDDDnDDnaDannc 
nnnnnnDDDDDnac 
DDnnnaDDnnnnac 
DDDnDaaDnDDnac 
DnnnnnDaaDDDDC 
nnDDDDnnnnnnac 
DaaanDDDDDnanc: 

nm-im-iMnnnnnnnr 


UUUUU 

nnnDD 
nnnnn 
Dnnnn 

DDDDD 


DDnaanDc 
nnnnaaDC 
DDDnnnac 
DDnnaDDC 

nnDDDDDC 


DDanDDDDDDnDnDC 

DDDDnaDnDDnDDnc 
DnnaaDnnnDnannc 
DnnnnaDnDnnnnac 
iDDDnnnnnaDannnc 
DDDnDaDDnnnnnac 
DnDDDnaDDnnDnnc 
Dnnnnn 


Daannn 

::]nnannn 

Dnnnnan 

DDnnnnn 

Dnnnnnn 


t± 


■L 


uuuuuu 

DDDDDD 

nanann 

DDDDn 

nnnnn 


SWs 


Figure  5.3 


31 


6.     NYU  Ultracomputer  Prototypes 

The  NYU  Ultracomputer  prototype  effort  began  in  1982  with  the  design  and 
implementation  of  a  two  processor  Ultracomputer  (Ultra  I,  Figure  6.1).  Two  Motorola 
68000  evaluation  boards  each  augmented  with  a  Motorola  68451  memory  management 
unit  were  interfaced  to  a  spare  memory  port  of  a  PUMA  (NYU  prototype  emulator  of  a 
CDC  6600)  /  DEC  PDP11  shared  memory  system.  The  PUMA  was  not  used  but  the 
PDP11  was  used  for  input  /  output.  The  purpose  of  this  prototype  and  those  that 
followed  was  to  provide  a  test  bed  for  operating  system  software  and  user  applications. 
A  highly  parallel  variant  of  the  UNIX  operating  system  was  developed  for  this  and  other 
NYU  prototypes. 


2  PE  NYU  Ultracomputer  Prototype 

Ultra  I 


68000 


1 


MMU 


68000 


MMU 


Motorola 

Evaluation 

Boards 


MNI 


8  MB 
Mem 


PP 


64  bit 


I 


Switch 


I 


PUMA 
ECL 


CDC    6600 
Emulator 


Backend    10 
Processor 


Dec    11/34 


v 

16  bit  Bus 


Decnet 


Figure  6.1 


32 


A  more  substantial  prototype  (Ultra  II,  Figure  6.2)  was  designed  and  built  during 
1983  /  84.  Several  copies  were  built.  Each  Ultra  II  contained  upto  eight  processors  (8 
PEs)  with  cache,  an  NYU  designed  Ultra  bus  and  four  two  megabyte  dynamic  ram 
memory  modules  (4  MMs  /  8  megabytes)  with  support  for  Fetch  and  Add.  The  MMs  were 
dual  ported,  the  second  port  was  again  used  by  a  PDP1 1  for  input  and  output.  The  Ultra 
bus  was  designed  with  separate  out  and  in  32  bit  data  buses  to  support  a  simple  fast 
Fetch  and  Add  operation.  The  asynchronous  Ultra  bus  also  supports  byte  write 
operations  and  contains  an  arbiter  which  arbitrates  fairly  among  the  eight  PEs,  each 
running  with  its  own  10  Mhz  clock.  The  Ultra  bus  bandwidth  is  8  megabytes  per  second. 


8  PE  NYU  Ultracomputer  Prototype 

Ultra  II 


Processors  with  Cache 


68010, 
32KB 

Weitek, 
Cache     o 

z 

68010, 
32KB 

Weitek, 
Cache     i 

68010, 
32KB 

Weitek, 
Cache     2 

68010, 
32KB 

Weitek, 
Cache      3 

68010, 
32KB 

Weitek, 
Cache     4 

68010, 
32KB 

Weitek, 
Cache     5 

68010, 
32KB 

Weitek, 
Cache     6 

68010, 
32KB 

Weitek, 
Cache     i 

\ 

y^  Dual    Ported 

^^s^  Shared   Memory 


U 

L 
T 
R 
A 

B 
U 
S 


32  bit   Bus 


2MB 

0 

2MB 

1 

2MB 
2 

2MB 

3 

Arbltor 

ARB 

T 

Backend     I/O 
Processor 


Unlbus 
Interface 


—  Dec    11/34 


V 


—  Decnet 


16  bit  Bus 


Figure  6.2 


33 


The  Ultra  II  memory  modules  (MMs)  utilized  256  kb  dynamic  ram  chips  to  create 
two  1  megabyte  memory  arrays  per  MM.  The  MMs  were  dual  ported  and  contained 
individual  arbiters  so  that  the  Ultra  and  Dec  could  access  separate  MMs  simultaneously. 
A  key  to  the  memory  design  was  a  32  bit  adder  to  support  Fetch  and  Add.  The  single 
Dec  interface  module  interfaces  the  second  port  of  the  MMs  to  the  Dec  UNIBUS  and 
maps  a  12  kilobyte  space  of  the  Dec  UNIBUS's  256  kilobyte  space  onto  the  MMs 
through  12  individual  1  kilobyte  windows.  This  permits  data  transfers  directly  from  PDP 
11  Disk  to  MM. 

The  processor  design  was  the  most  complicated  since  it  had  to  incorporate  many 
features.  The  first  important  diagnostic  feature  was  a  stand  alone  processor,  with  RAM, 
ROM,  and  UART  and  capability  of  accessing  global  memory.  A  ROM  monitor  was 
provided  that  incorporated  a  Motorola  Assembler  /  Disassembler  and  an  NYU  memory 
diagnostic.  A  Motorola  68451  memory  management  unit  (MMU)  was  used  and 
supported  by  our  modified  UNIX  operating  system.  Since  the  processor  has  only  a  16 
bit  bus,  assembly  and  dissassembly  of  32  bit  data  was  provided  to  support  atomic  32  bit 
loads,  stores  and  Fetch  and  Adds.  A  32  kilobyte  2-way  set  associative,  write  through, 
least  recently  used  evict  (LRU),  virtual  address  cache  was  incorporated  to  relieve  bus 
congestion.  To  satisfy  user  floating  point  requirements,  Weitek  floating  point  chips  were 
interfaced  to  the  PE. 

Since  the  processor  selected  (Motorola  68010)  did  not  support  Fetch  and  Add 
(FAA),  the  PE  design  had  to  include  a  mechanism  for  it.  The  design  chosen  was  a  two 
register  solution.  First  the  virtual  address  of  the  FAA  operation  was  stored  in  the 
Address  Register  (AR).  Then  the  increment  was  stored  in  the  Data  Register  (DR). 
Finally,  a  load  was  issued  referencing  a  special  address.  This  caused  the  FAA  logic  to 
direct  the  stored  virtual  address  in  the  AR  through  the  MMU  to  generate  a  physical 
address,  which  would  be  placed  on  the  address  lines  of  the  Ultra  bus.  Meanwhile  the 
increment  value  in  the  DR  was  presented  on  the  data  out  lines.  The  old  value  stored  in 

34 


memory  was  returned  on  the  data  in  lines  and  then  to  the  processor  while  the  sum  of 
this  value  plus  the  increment  value  was  stored  back  into  the  memory. 

The  construction  of  Ultra  II  basically  consisted  of  off  the  shelf  integrated  circuit 
components  and  two  sizes  of  wire  wrapped  boards,  9  inch  by  9  inch  (LB)  and  4.5  inch 
by  9  inch  (SB).  Two  LBs  and  two  SBs  make  up  one  PE.  One  LB  and  one  SB  was  used 
for  each  2  megabyte  MM.  The  LB  of  the  MM  contained  the  memory  array,  adder  and 
data  paths.  It  was  implemented  with  printed  circuit  board  technology  after  first  being 
wirewrapped  and  tested.  The  arbiter  and  Dec  interface  boards  were  LBs.  Eight  PE 
board  sets,  four  MM  board  sets,  an  arbiter  board,  and  a  Dec  interface  board  plugged 
into  a  drawer  with  a  22  inch  by  15  inch  wirewrapped  backplane.  An  LED  display  front 
panel  giving  status  information  for  each  PE  and  a  back  panel  with  two  RS232  ports  per 
PE  completed  the  packaging  of  Ultra  II. 


35 


7.    Ultra  III  Prototype 

With  the  advent  of  the  megabit  dynamic  ram  and  high  performance,  true  32  bit 
RISC  processors  with  32  bit  buses,  caches,  and  support  for  multiple  outstanding 
requests  and  floating  point  coprocessors.  Ultra  II  becomes  a  less  viable  prototype  for  a 
network  Ultracomputer  and  a  new  prototype  is  needed.  The  Ultra  III  design  must  include 
a  path  toward  fabrication  of  larger  Ultracomputers.  Such  a  packaging  capability  is 
available  with  the  use  of  the  'button'  technology  and  ES-Kit  modules  as  described 
above.    A  low  risk  migration  path  toward  Ultra  III  presently  in  process  is  now  discussed. 

Three  basic  modules  must  be  designed:  a  processor  PE,  switch  SW  and  memory 
MM.  To  ease  testing  an  incremental  path  from  Ultra  II  has  been  chosen.  First  a 
synchronous  4  megabyte  memory  module  (MM)  utilizing  the  megabit  dynamic  ram  with 
a  network  protocol  and  Fetch  and  Op,  was  designed  for  Ultra  III,  and  Ultra  II  was 
retrofitted  to  accept  it.  The  retrofit  required  a  mod  to  the  arbiter  board  to  packetize  Ultra 
bus  requests.  This  produced  a  16  megabyte  Ultra  11+  (Figure  7.1),  which  is  now 
operational  and  provides  users  with  expanded  memory.  The  second  module 
implemented  was  the  Ultra  III  2  input,  2  output  switch  module  (SW).  It  utilizes  NYU 
designed  VLSI  noncombining  switch  chips  and  has  been  used  to  connect  two  Ultra  11+ 
arbiters  to  two  Ultra  III  MMs.  This  configuration  is  now  functional  as  Ultra  ll+SW 
(Figure  7.2). 


36 


8  PE  NYU  Ultracomputer  Prototype 

Ultra  11+ 


Processors  with  Cache 


68010,    Weitek,       . 5 

32KB  Cache      0 

68010,    Weitek, 
32KB  Cache      1 

68010,    Weitek, 
32KB  Cache      2 

68010,    Weitek, 
32KB  Cache      3 

68010,    Weitek, 
32KB  Cache      4 

68010,    Weitek, 
32KB  Cache      5 

68010,    Weitek, 
32KB  Cache      6 

68010,    Weitek, 
32KB  Cache      7 

\ 

Dual    Ported 
Shared    Memory 


Backend     I/O 
Processor 


1Mb  chip 
Ultra  III  Memory 


32  bit   Bus 


Figure  7.1 


37 


/\ 


u 

L 
T 
R 
A 

B 


Arbiter  and 
Packetlzer 


ARB  & 
Packet 


U 

L 
T 
R 
A 


U 


S/ 


Arbiter  and 
Packetlzer 


ARB  & 
Packet 


NYU  Ultracomputer  Prototype 

Ultra  11+  SW 


SW 


NYU 

VLSI 
2X2 

Switch 


Dual    Ported 
Shared    Memory 


4MB  n 


4MB   -> 


/\ 


Backend     I/O 
Processor 


Unlbus 
Interface 


Ultra  III  Switch  -jmi,  chip 

Ultra  ill  Memory 


Figure  7.2 


Dec    11/34 


Decnet 


16  bit  Bus 


38 


The  first  version  of  the  Ultra  III  PE  was  built  using  the  AMD  29000  processor  and 
supporting  the  Ultra  III  network  protocol.  It  replaces  one  of  the  Ultra  ll+'s  of  the  Ultra 
ll+SW  prototype.  The  resulting  system  is  the  first  version  of  Ultra  III,  or  Ultra  IIIA  (Figure 
7.3).  It  presently  executes  stand  alone  programs;  the  operating  system  is  nearly  ready. 
The  next  version  of  the  Ultra  III  PE  is  now  in  design.  It  will  support  separate  instnjction 
and  data  caches  using  the  AMD  29062  cache  chips,  burst  mode,  multiple  outstanding 
requests,  and  Fetch  and  Add.  Control  and  network  protocol  will  be  implemented  via 
XILINX  programmable  gate  arrays. 

The  goal  of  the  next  phase  of  Ultracomputer  prototypes  is  to  develop  a  16  PE 
wirewrapped  Ultra  III  with  an  omega  interconnection  network  capable  of  combining 
(Figure  7.4)  for  the  purpose  of  testing  network  performance  and  to  facilitate  the 
fabrication  of  larger  Ultracomputers  with  larger  networks  and  substantially  more  PEs 
and  MMs.  The  16  PE  Ultra  III  will  contain  a  four  stage  network  with  32  2X2  combining 
SWs,  and  16  MMs  with  64  megabytes  of  global  memory.  The  success  of  this  design  will 
provide  specifications  for  ES-Kit  modules  from  which  a  64  PE  Ultra  III  can  be  fabricated. 


39 


NYU  Ultracomputer  Prototype 

Ultra  IMA 


/\ 


u 

L 
T 
R 
A 

B 

U 


S/ 


Arbiter  and 
Packetlzer 


ARB  & 
Packet 


AMD    29000 


sw 


NYU 

VLSI 
2X2 

Switch 


Ultra  III  Processor 


Ultra  III  Switch 


Dual    Ported 
Shared    Memory 


4MB 


4MB 


/\ 


Backend     I/O 
Processor 


Unlbus 
Interface 


1Mb  chip 
Ultra  III  Memory 


—  Dec    11/34 


Decnet 


16  bit  Bus 


Figure  7.3 


40 


Processors 

AMD 

29000 
1 

AMD 

29000 
2 

1 
1 
1 

1 

AMD 

29000 
1  6 

16  PE  NYU  Ultracomputer  Prototype 

Ultra  III 

(Proposed,    wirewrapped) 


Interconnection  Dual    Ported 

Network   (Omega)  Shared    Memory 


/\ 


Backend     I/O 
Processor 


Unlbus 
Interface 


NYU  VLSI  2X2   Switches        1Mb  chip 


Dec    11/34 


Decnet 


16  bit  Bus 


Figure  7.4 


41 


Appendix 


Ultracomputer  'ES-Kit'  Board 


10  in. 


••• 
•• 
••• 
•• 
••• 

•v. 


v> 


•  •• 

••v 
•v. 


t 


w 

•.V 

•  •• 

•  • 

•.V 

•  •• 
•• 

•  •• 

•  • 

•  •• 


Note:      Standard  ES-KIt  boards  are  7  In.  by  5.5  In. 
Top  conductor  pads  need  not  connect  to 
bottom  conductor  pads. 


"vn 


•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 


•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 


•  •• 

•  • 

•  •• 

•  • 

•  «• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 


•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 

•  • 

•  •• 


-TNJ 


7  In. 


Stack  of  Boards  and  Buttons 


D 


Q 


Figure  A1 


42 


Cinch  Button  Connector 


o.5i 


•oooooooooooo 
oooooooooooooo 


oooooooooooooo 
PP  op  op  o  o  op  op  o 

€k Q Q m m m 


ooooooooooooo 
oooooooooooooo 

oooooooooooooo 
o  op  op  op  op  op  op 


T 


1 


3.5" 


1 


A  =   0.5",   0.75",   or   1.1" 


J: 


V 


C3  O  O  O  O  O  C5  C3  O  O  O  ^ 
OOCZ>OOOOOC30000C3 


OC3C50CDOOOOOOOOC3 

<a>        «»«!!!»        ga        ga        aa 


C300000000000 

^        «ga        <»        ggi 


« 


Notes: 

Each  button  connector  has  106  signal  pins  with  controlled  impedance, 
28  ground  pins,  plus  8  power  pins.    They  also  have  two  alignment  pins  and 
three  bolt  holes. 

The  button  contacts  are  fabricated  from  2  mil  diameter  strands  of  wire 
randomly  pressed  into  a  cylinder.    Contact  is  made  between  button  and  PC 
board  'land'  by  pressure  (1-2  ounces  per  button). 

Connectors  are  made  by  pressing  buttons  into  rods  formulated  as 
spacers  between  stacked  boards.    Every  rod  has  a  button  on  both  ends. 


Figure  A2 


43 


References 


[1]  R.  Bianchini,  R.  Bianchini,  Jr.,  "Wireability  of  an  Ultracomputer,"  Ultracomputer 
Note  #43,  Courant  Institute,  Ultracomputer  Research  Laboratory,  NYU,  October  1982. 

[2]  A.  Gottlieb,  NYU,  "An  Overview  of  the  NYU  Ultracomputer  Project,"  Experimental 
Parallel  Computing  Architectures,  J.  Dongarra,  ed.,  North  Holland,  pp.  25-95,  1987. 

[3]  R.  Smolley,  "Button  Board,  A  New  Technology  Interconnect  for  2  and  3 
Dimensional  Packaging,"  Proceedings  of  ISHM  (International  Society  for  Hybrid 
Microelectronics),  Disneyland  Hotel,  Anaheim,  California,  November  1985. 

[4]        R.  Smolley,  "VHSIC  Phase  2  Packaging  Work  Shop," 
Inn  at  the  Park,  Houston,  Texas,  f^ay  1 987. 

[5]  R.  J.  Smith,  I!,  "ES-Kit  Project  Ovemevv,"  MCC  Technical  Report  #  ACA-ESP-391- 
87,  December  1987. 

[6]  R.  J.  Smith,  II,  "ES-Kit  Hardware  Technical  Memoranda  (1H88),"  MCC  Technical 
Report  #  ACA-ESP-1 37-88,  April  1988. 

[7]  S.  Dickey,  R.  Kenner,  M.  Snir,  and  J.  Solworth,  "A  VLSI  Combining  Network  for 
the  NYU  Ultracomputer,"  Proceedings  of  the  IEEE  International  Conference  on 
Computer  Design,  pp.  110-113,  1985. 

[8]  S.  Dickey,  A.  Gottlieb  and  R.  Kenner,  "Using  VLS!  to  Reduce  Serialization  and 

Memory  Traffic  in  Shared  Memory  Parallel  Computers,"  Advanced  Research  in  VLSI: 
Proceedings  of  the  Fourth  MIT  Conference  on  VLSI,  C.  E.  Leiserson,  ed.,  pp.  299-316, 
1986. 

[9]  J.  A.  Fulton,  D.  R.  Horton,  R.  C.  Moore,  W.  R.  Lambert.  S.  Jin,  R.  L  Opila,  R.  C. 
Sherwood,  T.  H.  Tiefel,  J.  J.  Mottine,  "Electrical  and  Mechanical  Properties  of  a  Metal- 
Filed  Polymer  Composite  for  Interconnection  and  Testing  Applications,"  39th  Electronic 
Component  Conference,  Houston,  Texas,  May  1989. 


44 


